Chapter 5 - New Developments: Topic Modeling with BERTopic!
Contents
Chapter 5 - New Developments: Topic Modeling with BERTopic!#
2022 July 30

What is BERTopic?#
As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”
Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”
For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.
If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.
This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)
In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.
Required installs:#
# Installs the base bertopic module:
!pip install bertopic
# If you want to use other transformers/language backends, it may require additional installs:
# !pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'
# bertopic also comes with its own handy visualization suite:
# !pip install bertopic[visualization]
Collecting bertopic
Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB)
Collecting pyyaml<6.0
Using cached PyYAML-5.4.1-cp39-cp39-macosx_10_9_x86_64.whl (259 kB)
Requirement already satisfied: pandas>=1.1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.4.3)
Collecting hdbscan>=0.8.28
Using cached hdbscan-0.8.28.tar.gz (5.2 MB)
Installing build dependencies ... ?25l-
\
|
/
-
done
?25h Getting requirements to build wheel ... ?25l-
done
?25h Preparing metadata (pyproject.toml) ... ?25l-
done
?25hCollecting plotly>=4.7.0
Using cached plotly-5.10.0-py2.py3-none-any.whl (15.2 MB)
Collecting sentence-transformers>=0.4.1
Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
Preparing metadata (setup.py) ... ?25l-
\
done
?25hCollecting umap-learn>=0.5.0
Using cached umap-learn-0.5.3.tar.gz (88 kB)
Preparing metadata (setup.py) ... ?25l-
done
?25hRequirement already satisfied: numpy>=1.20.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.23.1)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.1.2)
Requirement already satisfied: tqdm>=4.41.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (4.64.0)
Requirement already satisfied: joblib>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.1.0)
Collecting cython>=0.27
Using cached Cython-0.29.32-py2.py3-none-any.whl (986 kB)
Requirement already satisfied: scipy>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.9.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2022.1)
Collecting tenacity>=6.2.0
Using cached tenacity-8.0.1-py3-none-any.whl (24 kB)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from scikit-learn>=0.22.2.post1->bertopic) (3.1.0)
Collecting transformers<5.0.0,>=4.6.0
Using cached transformers-4.21.1-py3-none-any.whl (4.7 MB)
Collecting torch>=1.6.0
Using cached torch-1.12.1-cp39-none-macosx_10_9_x86_64.whl (133.8 MB)
Collecting torchvision
Using cached torchvision-0.13.1-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB)
Requirement already satisfied: nltk in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (3.7)
Collecting sentencepiece
Using cached sentencepiece-0.1.97-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB)
Collecting huggingface-hub>=0.4.0
Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Collecting numba>=0.49
Using cached numba-0.56.0-cp39-cp39-macosx_10_14_x86_64.whl (2.4 MB)
Collecting pynndescent>=0.5
Using cached pynndescent-0.5.7.tar.gz (1.1 MB)
Preparing metadata (setup.py) ... ?25l-
done
?25hRequirement already satisfied: packaging>=20.9 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (21.3)
Requirement already satisfied: requests in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.28.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (4.3.0)
Collecting filelock
Using cached filelock-3.8.0-py3-none-any.whl (10 kB)
Collecting numpy>=1.20.0
Using cached numpy-1.22.4-cp39-cp39-macosx_10_15_x86_64.whl (17.7 MB)
Requirement already satisfied: setuptools in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (63.4.1)
Collecting llvmlite<0.40,>=0.39.0dev0
Using cached llvmlite-0.39.0-cp39-cp39-macosx_10_9_x86_64.whl (25.5 MB)
Requirement already satisfied: six>=1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.1.5->bertopic) (1.16.0)
Requirement already satisfied: regex!=2019.12.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2022.7.25)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
Using cached tokenizers-0.12.1-cp39-cp39-macosx_10_11_x86_64.whl (3.6 MB)
Requirement already satisfied: click in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from nltk->sentence-transformers>=0.4.1->bertopic) (8.1.3)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (9.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.0.9)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2022.6.15)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.1.0)
Building wheels for collected packages: hdbscan, sentence-transformers, umap-learn, pynndescent
Building wheel for hdbscan (pyproject.toml) ... ?25l-
\
|
/
-
\
|
/
-
\
|
/
-
\
|
/
-
\
|
/
-
\
|
/
done
?25h Created wheel for hdbscan: filename=hdbscan-0.8.28-cp39-cp39-macosx_11_0_x86_64.whl size=718823 sha256=bd31dc962cf1d75474e78b7b17813c1e1880d8d769fd3e74603e227de0f0f260
Stored in directory: /Users/evanmuzzall/Library/Caches/pip/wheels/97/2d/1e/d9907e8f806ee949f9effc41004d7f32e862f6f67d9157812d
Building wheel for sentence-transformers (setup.py) ... ?25l-
\
|
/
done
?25h Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125925 sha256=992d4ed49cea1107188ef4741e29d33b17eb6b600acd2f5108beb3d370ac8848
Stored in directory: /Users/evanmuzzall/Library/Caches/pip/wheels/71/67/06/162a3760c40d74dd40bc855d527008d26341c2b0ecf3e8e11f
Building wheel for umap-learn (setup.py) ... ?25l-
\
done
?25h Created wheel for umap-learn: filename=umap_learn-0.5.3-py3-none-any.whl size=82814 sha256=05a1646183431b66027efebca3117a547019aaa7bb27e1d08d35a938d1220da2
Stored in directory: /Users/evanmuzzall/Library/Caches/pip/wheels/f4/3e/1c/596d0a463d17475af648688443fa4846fef624d1390339e7e9
Building wheel for pynndescent (setup.py) ... ?25l-
\
done
?25h Created wheel for pynndescent: filename=pynndescent-0.5.7-py3-none-any.whl size=54269 sha256=9233114d5305951b9eeb432212bd68f8f2287e46d58999457c60f2ee61643301
Stored in directory: /Users/evanmuzzall/Library/Caches/pip/wheels/5b/f5/6e/aac11d69fe2115d9ac871d6c148b361f0d3f8a35ed7354fa03
Successfully built hdbscan sentence-transformers umap-learn pynndescent
Installing collected packages: tokenizers, sentencepiece, torch, tenacity, pyyaml, numpy, llvmlite, filelock, cython, torchvision, plotly, numba, huggingface-hub, transformers, sentence-transformers, pynndescent, hdbscan, umap-learn, bertopic
Attempting uninstall: pyyaml
Found existing installation: PyYAML 6.0
Uninstalling PyYAML-6.0:
Successfully uninstalled PyYAML-6.0
Attempting uninstall: numpy
Found existing installation: numpy 1.23.1
Uninstalling numpy-1.23.1:
Successfully uninstalled numpy-1.23.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pycontractions 2.0.1 requires language-check>=1.0, which is not installed.
Successfully installed bertopic-0.11.0 cython-0.29.32 filelock-3.8.0 hdbscan-0.8.28 huggingface-hub-0.8.1 llvmlite-0.39.0 numba-0.56.0 numpy-1.22.4 plotly-5.10.0 pynndescent-0.5.7 pyyaml-5.4.1 sentence-transformers-2.2.2 sentencepiece-0.1.97 tenacity-8.0.1 tokenizers-0.12.1 torch-1.12.1 torchvision-0.13.1 transformers-4.21.1 umap-learn-0.5.3
Data sourcing#
For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:
import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
print(documents[0]) # Any ice hockey fans?
---------------------------------------------------------------------------
ContextualVersionConflict Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 import bertopic
2 from bertopic import BERTopic
3 from sklearn.datasets import fetch_20newsgroups
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/__init__.py:1, in <module>
----> 1 from bertopic._bertopic import BERTopic
3 __version__ = "0.11.0"
5 __all__ = [
6 "BERTopic",
7 ]
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:23, in <module>
21 # Models
22 import hdbscan
---> 23 from umap import UMAP
24 from sklearn.feature_extraction.text import CountVectorizer
25 from sklearn.metrics.pairwise import cosine_similarity
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/umap/__init__.py:2, in <module>
1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
4 try:
5 with catch_warnings():
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/umap/umap_.py:47, in <module>
40 from umap.spectral import spectral_layout
41 from umap.layouts import (
42 optimize_layout_euclidean,
43 optimize_layout_generic,
44 optimize_layout_inverse,
45 )
---> 47 from pynndescent import NNDescent
48 from pynndescent.distances import named_distances as pynn_named_distances
49 from pynndescent.sparse import sparse_named_distances as pynn_sparse_named_distances
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/pynndescent/__init__.py:15, in <module>
11 except ImportError as e:
12 # might be a missing symbol due to e.g. tbb libraries missing
13 numba.config.THREADING_LAYER = "workqueue"
---> 15 __version__ = pkg_resources.get_distribution("pynndescent").version
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/pkg_resources/__init__.py:478, in get_distribution(dist)
476 dist = Requirement.parse(dist)
477 if isinstance(dist, Requirement):
--> 478 dist = get_provider(dist)
479 if not isinstance(dist, Distribution):
480 raise TypeError("Expected string, Requirement, or Distribution", dist)
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/pkg_resources/__init__.py:354, in get_provider(moduleOrReq)
352 """Return an IResourceProvider for the named module or requirement"""
353 if isinstance(moduleOrReq, Requirement):
--> 354 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
355 try:
356 module = sys.modules[moduleOrReq]
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/pkg_resources/__init__.py:909, in WorkingSet.require(self, *requirements)
900 def require(self, *requirements):
901 """Ensure that distributions matching `requirements` are activated
902
903 `requirements` must be a string or a (possibly-nested) sequence
(...)
907 included, even if they were already activated in this working set.
908 """
--> 909 needed = self.resolve(parse_requirements(requirements))
911 for dist in needed:
912 self.add(dist)
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/pkg_resources/__init__.py:800, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras)
797 if dist not in req:
798 # Oops, the "best" so far conflicts with a dependency
799 dependent_req = required_by[req]
--> 800 raise VersionConflict(dist, req).with_context(dependent_req)
802 # push the new requirements onto the stack
803 new_requirements = dist.requires(req.extras)[::-1]
ContextualVersionConflict: (numpy 1.23.1 (/Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages), Requirement.parse('numpy<1.23,>=1.18'), {'numba'})
Creating a BERTopic model:#
Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:
language-> the language of your documentsmin_topic_size-> the minimum size of a topic; increasing this value will lead to a lower number of topicsembedding_model-> what model you want to use to conduct your word embeddings; many are supported!
For a full list of the parameters and their significance, please see https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py.
Of course, you can always use the default parameter values and instantiate your model as
model = BERTopic(). Once you’ve done so, you’re ready to fit your model to your documents!
Example instantiation:#
from sklearn.feature_extraction.text import CountVectorizer
# example parameter: a custom vectorizer model can be used to remove stopwords from the documents:
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
# instantiating the model:
model = BERTopic(vectorizer_model = stopwords_vectorizer)
Fitting the model:#
The first step of topic modeling is to fit the model to the documents:
topics, probs = model.fit_transform(documents)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
.fit_transform()returns two outputs:topicscontains mappings of inputs (documents) to their modeled topic (alternatively, cluster)probscontains a list of probabilities that an input belongs to their assigned topic
Note:
fit_transform()can be substituted withfit().fit_transform()allows for the prediction of new documents but demands additional computing power/time.
Viewing topic modeling results:#
The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:
# view your topics:
topics_info = model.get_topic_info()
# get detailed information about the top five most common topics:
print(topics_info.head(5))
Topic Count Name
0 -1 6646 -1_file_use_need_using
1 0 1838 0_team_games_players_season
2 1 616 1_clipper_encryption_chip_nsa
3 2 527 2_cheek ken_ken huh_ignore art_huh ignore
4 3 452 3_israel_israeli_jews_palestinian
When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.
Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.
# access a single topic:
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
# get representative documents for a specific topic:
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics
["\ni have no idea, nor do i care. however, i'd like to point out that\nblomberg got the first plate appearance by a designated hitter, and\nthe first walk by a designated hitter. i am not sure, but i do not\nthink that he also got the first hit by a designated hitter.", ": >\n: >ATLANTIC DIVISION\n: >\t\n: >\tST JOHN'S MAPLE LEAFS VS MONCTON HAWKS\n: >\tMONCTON HAWKS\n: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,\n: >defensive, good goaltending. John Leblanc and Stu Barnes are the only\n: >noticable guns on the team. But the defense is top notch and \n: >Mike O'Neill is the most underrated goalie in the league.\n: >\n\n: Bri, as I have tried to tell you since 2 February, Michael O'Neill\n: might be the most underrated goalie in the AHL, but he ISN'T in the\n: AHL. He's on the Winnipeg Jets' injury list, as he has been since\n: his first NHL start against the Ottawa Senators. He's out until\n: next year after surgery to repair a shoulder separation.\n\n: Stu Barnes might be an AHL gun for the Hawks, but he's now the third\n: line center with the Jets, and has been since mid January or so.\n\nSorry, my memory is gone. I thought that O'Neill got sent back\ndown in February but I must have been given incorrect info. I guess\nthis says it all about Moncton because Barnes is still one of\ntheir top 3 or so scorers even though he's been out since January.", "\n\nI didn't see any smilies in this message so.......\n\n W T L PTs\n Team A 50 30 4 104\n Team B 52 32 0 104\n\n\nThere you go. Two teams that tie in points without identical records.\n\n"]
# find topics similar to a key term/phrase:
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics
# print the initial contents of the most similar topics
for topic_num in topics:
print('\nContents from topic number: '+ str(topic_num) + '\n')
print(model.get_topic(topic_num))
Most common topics:[0, 30, 6, 166, 4]
Contents from topic number: 0
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
Contents from topic number: 30
[('games', 0.03260548961663573), ('sega', 0.02366315012814771), ('arcade', 0.012166539858844822), ('snes', 0.010883627526511617), ('sega genesis', 0.01081910740506706), ('joysticks', 0.010294764495945618), ('games sale', 0.010085068481475858), ('sale', 0.00964091677280479), ('joystick', 0.009006639792149954), ('sega cd', 0.0074012373591723)]
Contents from topic number: 6
[('riding', 0.011792240692170709), ('ride', 0.011256591323418531), ('driving', 0.007418204752466058), ('road', 0.007362304673149508), ('traffic', 0.006971330162717447), ('roads', 0.005093305390738552), ('bikes', 0.0046328368271995445), ('bikers', 0.0041220512073587194), ('riders', 0.0037367046265679754), ('passengers', 0.0035386604055364823)]
Contents from topic number: 166
[('religion', 0.024810151190057972), ('war', 0.01958713595572545), ('wars', 0.0141305144151792), ('crusades', 0.012827683749926261), ('history', 0.01202363443416338), ('religious', 0.009458363539211138), ('unbelievers', 0.008338773663764506), ('yoked unbelievers', 0.007970064155940823), ('statement religion', 0.007495172035922859), ('gods', 0.0071255212864334274)]
Contents from topic number: 4
[('health', 0.0072259305085357), ('cancer', 0.005975505039095839), ('disease', 0.00513078203584376), ('tobacco', 0.005069613472607038), ('medical', 0.00492433353954727), ('hiv', 0.004709304265420622), ('malaria', 0.004112010029452724), ('smokeless tobacco', 0.004033769948845448), ('lyme', 0.003923377448522405), ('medical newsletter', 0.003903230753928965)]
Saving/loading models:#
One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!
# save your model:
# model.save("TAML_ex_model")
# load it later:
# loaded_model = BERTopic.load("TAML_ex_model")
Visualizing topics:#
Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.
Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!
Let’s see some examples!
# Create a 2D representation of your modeled topics & their pairwise distances:
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form!
model.visualize_barchart()
# Evaluate topic similarity through a heat map:
model.visualize_heatmap()
Conclusion#
Hopefully you’re convinced of how accessible but powerful a technique BERTopic topic modeling can be! There’s plenty more to learn about BERTopic than what we’ve covered here, but you should be ready to get started!
During your adventures, you may find the following resources useful:
Original BERTopic Github: https://github.com/MaartenGr/BERTopic
BERTopic visualization guide: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-terms
How to use BERT to make a custom topic model: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6
Recommended things to look into next include:
how to select the best embedding model for your BERTopic model;
controlling the number of topics your model generates; and
other visualizations and deciding which ones are best for what kinds of documents.
Questions? Please reach out! Anthony Weng, SSDS consultant, is happy to help (contact: ad2weng@stanford.edu)